Agenda

Announcements

  • Final Project update
  • MDSR Ch 15 programming notebook assigned
  • Beckman’s office hour cancelled on Thurs 4/4 (see syllabus for other options)
  • MDSR Ch 15 Exercises due 4/14 before midnight
  • Lots of errata/tips this chapter

MDSR Ch 15 Errata / Tips

  • Some sections don’t require programming, but please still include the headers for navigation purposes
  • p. 361: Note–data("DataSciencePapers") should appear BOTH in front matter (per style guide) AND as a code chunk in sequence with the other MDSR code so your section 15.2 results in the programming notebook will match the MDSR book rather than using new data queried from arXiv also loaded on p. 361 under the object name DataSciencePapers
  • p. 365: your word cloud won’t be identical and may even exclude “data” if it doesn’t fit on the page
  • p. 368: Wikipedia changed the tables all around. Use Table[[4]] when scraping; Title is now Song; the results won’t exactly match but you’ll be able to work it out.
  • p. 369: change 25 to 15 (the result is quite a different message!)
  • p. 370: you don’t need a Twitter account, the provided credentials work… here they are:
    • consumer_key = “u2UthjbK6YHyQSp4sPk6yjsuV”
    • consumer_secret = “sC4mjd2WME5nH1FoWeSTuSy7JCP5DHjNtTYU1X6BwQ1vPZ0j3v”
    • access_token = “1365606414-7vPfPxStYNq6kWEATQlT8HZBd4G83BBcX4VoS9T”
    • access_secret = “0hJq9KYC3eBRuZzJqSacmtJ4PNJ7tNLkGrQrVl00JHirs”
  • p. 372: you’ll need to load the RSQLite library
  • p. 373: you can skip the geocode(...) function from the ggmap package and just assign lon and lat directly. You can even look up the coordinates for State College and substitute that if you like. Oddly enough, you might have to give a credit card number to Google if you want to use geocode() so feel free to modify your code to avoid that step.

Road map

Project Gutenberg

Macbeth Summary

Our “muse” for this first portion is the famous play MacBeth by William Shakespeare. If you aren’t familiar with the play, here is a quick summary:

Text as data

Macbeth text data intake

macbeth_url <- "http://www.gutenberg.org/cache/epub/1129/pg1129.txt"
Macbeth_raw <- RCurl::getURL(macbeth_url)
# Macbeth_raw

Text as data

Back to Macbeth

macbeth_tmp <- strsplit(Macbeth_raw, "\r\n")
str(macbeth_tmp)
List of 1
 $ : chr [1:3193] "This Etext file is presented by Project Gutenberg, in" "cooperation with World Library, Inc., from their Library of the" "Future and Shakespeare CDROMS.  Project Gutenberg often releases" "Etexts that are NOT placed in the Public Domain!!" ...
macbeth <- strsplit(Macbeth_raw, "\r\n")[[1]]
length(macbeth)
[1] 3193
head(macbeth)
[1] "This Etext file is presented by Project Gutenberg, in"           
[2] "cooperation with World Library, Inc., from their Library of the" 
[3] "Future and Shakespeare CDROMS.  Project Gutenberg often releases"
[4] "Etexts that are NOT placed in the Public Domain!!"               
[5] ""                                                                
[6] "*This Etext has certain copyright implications you should read!*"

The set up…

Manual text analysis

macbeth[295:310]
 [1] ""                                                          
 [2] "SCENE II."                                                 
 [3] "A camp near Forres. Alarum within."                        
 [4] ""                                                          
 [5] "Enter Duncan, Malcolm, Donalbain, Lennox, with Attendants,"
 [6] "meeting a bleeding Sergeant."                              
 [7] ""                                                          
 [8] "  DUNCAN. What bloody man is that? He can report,"         
 [9] "    As seemeth by his plight, of the revolt"               
[10] "    The newest state."                                     
[11] "  MALCOLM. This is the sergeant"                           
[12] "    Who like a good and hardy soldier fought"              
[13] "    'Gainst my captivity. Hail, brave friend!"             
[14] "    Say to the King the knowledge of the broil"            
[15] "    As thou didst leave it."                               
[16] "  SERGEANT. Doubtful it stood,"                            

Manual text analysis

Regular Expressions (RegEx)

grep( ) & grepl( )

macbeth_lines <- grep("MACBETH", macbeth)
# macbeth_lines <- grep("MACBETH", macbeth, value = TRUE)
# macbeth_lines <- grepl("MACBETH", macbeth)
length(macbeth_lines)
[1] 208
head(macbeth_lines)
[1] 218 228 230 433 443 466
identical(c(1:3), c(1L, 2L, 3L))  # are these the same?
identical(c(1:3), c(1, 2, 3))     # `identical` is PICKY

# are these the same?
identical(macbeth[grep("MACBETH", macbeth)],   
          macbeth[grepl("MACBETH", macbeth)])

Refining our RegEx

macbeth_lines <- grep("  MACBETH.", macbeth, value = TRUE)
length(macbeth_lines)
[1] 147
head(macbeth_lines)
[1] "  MACBETH, Thane of Glamis and Cawdor, a general in the King's"
[2] "  MACBETH. So foul and fair a day I have not seen."            
[3] "  MACBETH. Speak, if you can. What are you?"                   
[4] "  MACBETH. Stay, you imperfect speakers, tell me more."        
[5] "  MACBETH. Into the air, and what seem'd corporal melted"      
[6] "  MACBETH. Your children shall be kings."                      

Problem with period (.)

# anything with "MAC" and then another character
grep("MAC.", macbeth, value = TRUE) %>%
  head()
[1] "MACHINE READABLE COPIES MAY BE DISTRIBUTED SO LONG AS SUCH COPIES"
[2] "MACHINE READABLE COPIES OF THIS ETEXT, SO LONG AS SUCH COPIES"    
[3] "WITH PERMISSION.  ELECTRONIC AND MACHINE READABLE COPIES MAY BE"  
[4] "THE TRAGEDY OF MACBETH"                                           
[5] "  MACBETH, Thane of Glamis and Cawdor, a general in the King's"   
[6] "  LADY MACBETH, his wife"                                         
# MACBETH.
grep("MACBETH\\.", macbeth, value = TRUE) %>% head(20)
 [1] "  MACBETH. So foul and fair a day I have not seen."              
 [2] "  MACBETH. Speak, if you can. What are you?"                     
 [3] "  MACBETH. Stay, you imperfect speakers, tell me more."          
 [4] "  MACBETH. Into the air, and what seem'd corporal melted"        
 [5] "  MACBETH. Your children shall be kings."                        
 [6] "  MACBETH. And Thane of Cawdor too. Went it not so?"             
 [7] "  MACBETH. The Thane of Cawdor lives. Why do you dress me"       
 [8] "  MACBETH. [Aside.] Glamis, and Thane of Cawdor!"                
 [9] "  MACBETH. [Aside.] Two truths are told,"                        
[10] "  MACBETH. [Aside.] If chance will have me King, why, chance may"
[11] "  MACBETH. [Aside.] Come what come may,"                         
[12] "  MACBETH. Give me your favor; my dull brain was wrought"        
[13] "  MACBETH. Till then, enough. Come, friends.             Exeunt."
[14] "  MACBETH. The service and the loyalty lowe,"                    
[15] "  MACBETH. The rest is labor, which is not used for you."        
[16] "  MACBETH. [Aside.] The Prince of Cumberland! That is a step"    
[17] "  LADY MACBETH. \"They met me in the day of success, and I have" 
[18] "  LADY MACBETH. Thou'rt mad to say it!"                          
[19] "  LADY MACBETH. Give him tending;"                               
[20] "  MACBETH. My dearest love,"                                     

Simple RegEx Tools

# alternation with `|`
grep("MAC[B|D]", macbeth, value = TRUE) %>% head()
[1] "THE TRAGEDY OF MACBETH"                                        
[2] "  MACBETH, Thane of Glamis and Cawdor, a general in the King's"
[3] "  LADY MACBETH, his wife"                                      
[4] "  MACDUFF, Thane of Fife, a nobleman of Scotland"              
[5] "  LADY MACDUFF, his wife"                                      
[6] "  MACBETH. So foul and fair a day I have not seen."            
# "MAC" followed by any capital letter from "C" through "Z"
grep("MAC[C-Z]", macbeth, value = TRUE) %>% head(10)
 [1] "MACHINE READABLE COPIES MAY BE DISTRIBUTED SO LONG AS SUCH COPIES"
 [2] "MACHINE READABLE COPIES OF THIS ETEXT, SO LONG AS SUCH COPIES"    
 [3] "WITH PERMISSION.  ELECTRONIC AND MACHINE READABLE COPIES MAY BE"  
 [4] "  MACDUFF, Thane of Fife, a nobleman of Scotland"                 
 [5] "  LADY MACDUFF, his wife"                                         
 [6] "WITH PERMISSION.  ELECTRONIC AND MACHINE READABLE COPIES MAY BE"  
 [7] "WITH PERMISSION.  ELECTRONIC AND MACHINE READABLE COPIES MAY BE"  
 [8] "  MACDUFF. Was it so late, friend, ere you went to bed,"          
 [9] "  MACDUFF. What three things does drink especially provoke?"      
[10] "  MACDUFF. I believe drink gave thee the lie last night."         
# search for beginning of each act in the play (`^` goea at beginning)
grep("ACT", macbeth, value = TRUE) %>% head(10)
[1] "BREACH OF WARRANTY OR CONTRACT, INCLUDING BUT NOT LIMITED TO"
[2] "ACT I. SCENE I."                                             
[3] "ACT II. SCENE I."                                            
[4] "ACT III. SCENE I."                                           
[5] "ACT IV. SCENE I."                                            
[6] "ACT V. SCENE I."                                             
grep("^ACT", macbeth, value = TRUE) %>% head(10)
[1] "ACT I. SCENE I."   "ACT II. SCENE I."  "ACT III. SCENE I." "ACT IV. SCENE I."  "ACT V. SCENE I."  
# strings strictly ending in "MACBETH" (`$` goes at end)
grep("MACBETH$", macbeth, value = TRUE) %>% head(10)
[1] "THE TRAGEDY OF MACBETH"
# repetitions
grep("^ ?MAC", macbeth, value = TRUE) %>% head()  # zero or one leading spaces
[1] "MACHINE READABLE COPIES MAY BE DISTRIBUTED SO LONG AS SUCH COPIES"
[2] "MACHINE READABLE COPIES OF THIS ETEXT, SO LONG AS SUCH COPIES"    
grep("^ *MAC", macbeth, value = TRUE) %>% head()  # zero or more leading spaces
[1] "MACHINE READABLE COPIES MAY BE DISTRIBUTED SO LONG AS SUCH COPIES"
[2] "MACHINE READABLE COPIES OF THIS ETEXT, SO LONG AS SUCH COPIES"    
[3] "  MACBETH, Thane of Glamis and Cawdor, a general in the King's"   
[4] "  MACDUFF, Thane of Fife, a nobleman of Scotland"                 
[5] "  MACBETH. So foul and fair a day I have not seen."               
[6] "  MACBETH. Speak, if you can. What are you?"                      
grep("^ +MAC", macbeth, value = TRUE) %>% head()  # one or more leading spaces
[1] "  MACBETH, Thane of Glamis and Cawdor, a general in the King's"
[2] "  MACDUFF, Thane of Fife, a nobleman of Scotland"              
[3] "  MACBETH. So foul and fair a day I have not seen."            
[4] "  MACBETH. Speak, if you can. What are you?"                   
[5] "  MACBETH. Stay, you imperfect speakers, tell me more."        
[6] "  MACBETH. Into the air, and what seem'd corporal melted"      
grep("^ {2}MAC", macbeth, value = TRUE) %>% head()  # exaclty two leading spaces
[1] "  MACBETH, Thane of Glamis and Cawdor, a general in the King's"
[2] "  MACDUFF, Thane of Fife, a nobleman of Scotland"              
[3] "  MACBETH. So foul and fair a day I have not seen."            
[4] "  MACBETH. Speak, if you can. What are you?"                   
[5] "  MACBETH. Stay, you imperfect speakers, tell me more."        
[6] "  MACBETH. Into the air, and what seem'd corporal melted"      
grep("^ {3}MAC", macbeth, value = TRUE) %>% head()  # exactly three leading spaces
character(0)

Speaker frequency in Macbeth

Macbeth <- grepl("  MACBETH\\.", macbeth)
Macduff <- grepl("  MACDUFF\\.", macbeth)
LadyMacbeth <- grepl("  LADY MACBETH\\.", macbeth)
LadyMacduff <- grepl("  LADY MACDUFF\\.", macbeth)
Banquo <- grepl("  BANQUO\\.", macbeth)
Duncan <- grepl("  DUNCAN\\.", macbeth)
speaker_freq <- data.frame(Macbeth, Macduff, LadyMacbeth, LadyMacduff, Banquo, Duncan) %>%
  mutate(line = 1:length(macbeth)) %>%
  gather(key = "character", value = "speak", -line) %>%
  mutate(speak = as.numeric(speak)) %>%
  filter(line > 218 & line < 3172)
glimpse(speaker_freq)
Observations: 17,718
Variables: 3
$ line      <int> 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238,…
$ character <chr> "Macbeth", "Macbeth", "Macbeth", "Macbeth", "Macbeth", "Macbeth", "Macbeth", "Macbeth", "Macbeth", …
$ speak     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
# find the acts that subdivide the play
acts_idx <- grep("^ACT ", macbeth)
acts_labels <- str_extract(macbeth[acts_idx], "^ACT [I|V]+")  # [I|V]+ I, II, III, IV, V
acts <- data.frame(line = acts_idx, labels = acts_labels)
# plot speaker frequencies
speaker_freq %>%
  ggplot(aes(x = line, y = speak)) + 
  geom_smooth(aes(color = character), method = "loess", se = 0) + 
  geom_vline(xintercept = acts_idx, color = "darkgray", lty = 3) + 
  geom_text(data = acts, aes(y = 0.085, label = labels), 
            hjust = "left", color = "darkgray") + 
  ylim(c(0, NA)) + 
  xlab("Line Number") + 
  ylab("Proportion of Speeches")

Further Analysis

Indentifying lines & speakers

lines <- strsplit(Macbeth_raw, "\r\n {2}[A-Z| |0-9]*\\. ")[[1]][-1]
speakers <- stringr::str_extract_all(Macbeth_raw, " {2}[A-Z| |0-9]*\\. ")[[1]]
AllMacbeth <- data.frame(speakers, lines)

Clean up the lines

head(AllMacbeth)

Clean up the lines

AllMacbeth <- 
  AllMacbeth %>%
  mutate(lines = gsub(pattern = "[\r|\n]*", replacement = "", x = lines))
head(AllMacbeth)

More clean up

<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS PROVIDED BY PROJECT GUTENBERG ETEXT OF CARNEGIE MELLON UNIVERSITY WITH PERMISSION. ELECTRONIC AND MACHINE READABLE COPIES MAY BE DISTRIBUTED SO LONG AS SUCH COPIES (1) ARE FOR YOUR OR OTHERS PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED COMMERCIALLY. PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>

Another gsub

junk <- grepl(pattern = "<<.*>>", x = AllMacbeth$lines)
# Before
AllMacbeth %>%
  filter(junk) %>%
  select(speakers, lines)
# Correction
AllMacbeth <- 
  AllMacbeth %>%
  mutate(lines = gsub(pattern = "<<.*>>", replacement = "", x = lines))
# After
AllMacbeth %>%
  filter(junk) %>%
  select(speakers, lines)

Time to take a look!

First attempt: Counting words

countWords <- function(line) {
  return(length(strsplit(x = line, split = "\\s+")))
}
AllMacbeth$nWords <- sapply(X = AllMacbeth$lines, FUN = countWords)
head(AllMacbeth)

Debugging our cleanup

strsplit(x = AllMacbeth$lines[1], split = "\\s+")
[[1]]
 [1] "When"       "shall"      "we"         "three"      "meet"       "again?"     "In"         "thunder,"  
 [9] "lightning," "or"         "in"         "rain?"     

Debugging our cleanup

countWords <- function(line) {
  return(length(strsplit(x = line, split = "\\s+")[[1]]))
}
AllMacbeth$nWords <- sapply(X = AllMacbeth$lines, FUN = countWords)
head(AllMacbeth)

Choosing your character

# Longest individual lines 
AllMacbeth %>%
  select(speakers, nWords, lines) %>%
  arrange(desc(nWords)) 

Character analysis

MacbethSummary <- 
  AllMacbeth %>% 
  group_by(speakers) %>%
  summarise(lines = n(), 
            totalWords = sum(nWords), 
            wordsPerLine = totalWords/lines)
# Most words per line
MacbethSummary %>%
  arrange(desc(wordsPerLine)) %>%
  head(10)
# Most total content
MacbethSummary %>%
  arrange(desc(totalWords)) %>%
  head(10)

Who do you choose?

MyLines <- 
  AllMacbeth %>% 
  filter(grepl("  LADY MACBETH\\.", speakers))
head(MyLines)

Text Mining Macbeth

Text Mining Macbeth

Text Analysis

require(readtext)
ModernMacbeth <- 
  readtext::readtext(file = "/Users/mattbeckman/Documents/GitHub/Teaching/STAT-380/2019 Spring/ClassNotes/15-mdsr/plain_Macbeth/*.txt", docvarsfrom = "filenames", dvsep = "_") %>%
  as.tibble()
head(ModernMacbeth)

Corpus

Modern Macbeth Corpus

Tidy Text Format

Tokenization step

require(tidytext)
ModernMacbeth %>%
  select(text, act = docvar1, scene = docvar2) %>%
  unnest_tokens(output = word, input = text)         # single word tokenization
  # unnest_tokens(output = word, input = text, token = "ngrams", n = 3)   # n-gram tokenization

Bag of words

ModernMacbeth_tidy <- 
  ModernMacbeth %>%
  select(text, act = docvar1, scene = docvar2) %>%
  unnest_tokens(output = word, input = text)         # single word tokenization

Token frequency

ModernMacbeth_tidy %>%
  count(word, sort = TRUE)

Stop words

# native text
ModernMacbeth_tidy %>%
  count(word, sort = TRUE)
# load stop word list (English)
data("stop_words")   # from `tidytext` package
head(stop_words)
tail(stop_words)

Removing stop words

# # `stop_words` has two columns: "word" and "lexicon
# stop_words <-
#   rbind(stop_words,           
#         c("macbeth", "custom"))
# 
# stop_words %>%
#   filter(word == "it's")
# 
# # we'll need to use RegEx to clean these up
# grep(pattern = "it.s", x = ModernMacbeth_tidy$word, value = TRUE)
# grep(pattern = "’", x = ModernMacbeth_tidy$word, value = TRUE) %>% head(30)
ModernMacbeth_tidy <- 
  ModernMacbeth_tidy %>%
  # mutate(word = gsub(pattern = , replacement = , x = word)) %>%
  filter(!(word %in% stop_words$word))
ModernMacbeth_tidy %>%
  count(word, sort = TRUE)

Tidy text pipes to ggplot2

ModernMacbeth_tidy %>%
  mutate(word = gsub(pattern = "’", replacement = "'", x = word)) %>%
  count(word, sort = TRUE) %>%
  filter(n > 20) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Word clouds

ModernMacbeth_tidy %>%
  filter(!(word %in% stop_words$word)) %>%
  count(word) %>%
  with(., wordcloud(word, n, max.words = 45))

word clouds: Modern vs original

AllMacbeth_tidy <- 
  AllMacbeth %>%
  select(speakers, lines) %>%
  unnest_tokens(output = word, input = lines)     # single word tokenization
head(AllMacbeth_tidy)
# helper function for making wordclouds
macbeth_wordcloud <- function(Corpus_tidy, maxWords = 45) {
  Corpus_tidy %>%
    filter(!(word %in% stop_words$word)) %>%
    count(word) %>%
    with(., wordcloud(word, n, max.words = maxWords))
}
macbeth_wordcloud(Corpus = AllMacbeth_tidy)

macbeth_wordcloud(Corpus = ModernMacbeth_tidy)

Act by Act word clouds

macbeth_wordcloud(filter(ModernMacbeth_tidy, act == "Act1"))

macbeth_wordcloud(filter(ModernMacbeth_tidy, act == "Act2"))

macbeth_wordcloud(filter(ModernMacbeth_tidy, act == "Act3"))

macbeth_wordcloud(filter(ModernMacbeth_tidy, act == "Act4"))

macbeth_wordcloud(filter(ModernMacbeth_tidy, act == "Act5"))

Recall

Sentiment

require(tidytext)
data("sentiments")
head(sentiments, 10)
tail(sentiments, 10)

Sentiment lexicons

get_sentiments("nrc") %>%
  group_by(sentiment) %>%
  summarise(N = n()) %>%
  arrange(desc(N))

Sentiment analysis of Macbeth

nrc_anticipation <- get_sentiments("nrc") %>%
  filter(sentiment == "anticipation")
ModernMacbeth_tidy %>%
  inner_join(nrc_anticipation) %>%
  count(word, sort = TRUE)
Joining, by = "word"
NA

Changes in Sentiment

ModernMacbeth_sentiment <- 
  ModernMacbeth_tidy %>%
  mutate(act_scene = paste(act, scene, sep = "_")) %>%
  inner_join(get_sentiments("afinn")) %>%
  group_by(act, scene) %>%
  summarise(sentiment = mean(score, na.rm = TRUE)) 
Joining, by = "word"
head(ModernMacbeth_sentiment)

Plot changes in sentiment by Act

ModernMacbeth_sentiment %>%
  rownames_to_column() %>%
  mutate(rowname = parse_number(rowname)) %>%
  ggplot(aes(x = rowname, y = sentiment, fill = act)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  ggtitle("Sentiment Analysis of each Act & Scene in Modern Macbeth")

NA

Common positive and negative words

bing_word_counts <- 
  ModernMacbeth_tidy %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  ungroup()
Joining, by = "word"
bing_word_counts

Plotting word frequency by sentiment

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
Selecting by n

Wordclouds with feeling

ModernMacbeth_tidy %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 45))
require(reshape2)
ModernMacbeth_tidy %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"), max.words = 45)
Joining, by = "word"

Positive scenes in Macbeth

bingpositive <- get_sentiments("bing") %>% 
  filter(sentiment == "positive")
wordcounts <- ModernMacbeth_tidy %>%
  group_by(act, scene) %>%
  summarize(words = n())
ModernMacbeth_tidy %>%
  semi_join(bingpositive) %>%
  group_by(act, scene) %>%
  summarize(positivewords = n()) %>%
  left_join(wordcounts, by = c("act", "scene")) %>%
  mutate(ratio = positivewords / words) %>%         # ratio of positive words
  filter(scene != 0) %>%
  top_n(1) %>%
  ungroup()
Joining, by = "word"
Selecting by ratio

Analysis of word & document frequency

Document term matrices

\[idf(t, D) = \text{ln}\frac{|D|}{|\{d \in D : t \in d\}|}\]

Term frequency in Acts of Macbeth

# tf within each act
act_words <- 
  ModernMacbeth %>%
  select(text, act = docvar1, scene = docvar2) %>%
  unnest_tokens(output = word, input = text) %>%
  mutate(word = gsub(pattern = "’", replacement = "'", x = word)) %>%
  count(act, word, sort = TRUE)
# total words in the act
total_words <- 
  act_words %>% 
  group_by(act) %>% 
  summarize(total = sum(n))
act_words <- left_join(act_words, total_words)
Joining, by = "act"
act_words

Term frequency in Macbeth

act_words %>% 
  ggplot(aes(n / total, fill = act)) +
  geom_histogram(show.legend = FALSE) +
  facet_wrap(~ act, ncol = 2, scales = "free_y") + 
  ggtitle("Term frequency in Acts of Macbeth")

Zipf’s law

freq_by_rank <- 
  act_words %>% 
  group_by(act) %>% 
  mutate(rank = row_number(),   # since already ordered by `n`
         `term frequency` = n / total)
freq_by_rank

Visualization of Zipf’s law

freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = act)) + 
  geom_line(size = 1.1, alpha = 0.8, show.legend = TRUE) + 
  scale_x_log10() +
  scale_y_log10() + 
  ggtitle("Zipf's law for acts of Macbeth")

Investigating term frequency in Macbeth

rank_subset <- freq_by_rank %>% 
  filter(rank < 200,
         rank > 10)
lm(log10(`term frequency`) ~ log10(rank), data = rank_subset)

Call:
lm(formula = log10(`term frequency`) ~ log10(rank), data = rank_subset)

Coefficients:
(Intercept)  log10(rank)  
     -0.946       -0.909  
freq_by_rank %>% 
  ggplot(aes(rank, `term frequency`, color = act)) + 
  geom_abline(intercept = -0.946, slope = -0.909, color = "gray50", linetype = 2) +
  geom_line(size = 1.1, alpha = 0.8, show.legend = TRUE) + 
  scale_x_log10() +
  scale_y_log10() + 
  ggtitle("Zipf's law for Modern Macbeth")

tf-idf

act_words <- 
  act_words %>%
  bind_tf_idf(word, act, n)
act_words

High tf-idf in Macbeth

act_words %>%
  select(-total) %>%
  arrange(desc(tf_idf))

Visualizing high tf-idf words

act_words %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word)))) %>% 
  group_by(act) %>% 
  top_n(10) %>% 
  ungroup() %>%
  ggplot(aes(word, tf_idf, fill = act)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~act, ncol = 2, scales = "free") +
  coord_flip()
Selecting by tf_idf

Note about converting to/from other non-tidy formats

Document-term matrix

Topic Modeling

Latent Dirichlet allocation (LDA)

Preparing for LDA

data("AssociatedPress")
AssociatedPress
<<DocumentTermMatrix (documents: 2246, terms: 10473)>>
Non-/sparse entries: 302031/23220327
Sparsity           : 99%
Maximal term length: 18
Weighting          : term frequency (tf)

Two-topic Latent Dirichlet allocation (LDA) model

require(topicmodels)
ap_lda <- 
  AssociatedPress %>%
  LDA(k = 2, control = list(seed = 1234))  
ap_lda
A LDA_VEM topic model with 2 topics.

Per-topic-per-word probabilities

ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics

Top 10 per-topic-per-word probabilities (\(\beta\))

ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
ap_top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

Words with greatest difference in \(\beta\) between topics

beta_spread <- ap_topics %>%
  mutate(topic = paste0("topic", topic)) %>%
  spread(topic, beta) %>%
  filter(topic1 > .001 | topic2 > .001) %>%
  mutate(log_ratio = log2(topic2 / topic1)) %>%
  arrange(desc(log_ratio))
beta_spread

Per-document classification

scenes_gamma <- tidy(ap_lda, matrix = "gamma")
scenes_gamma %>% 
  arrange(document, gamma)

Let’s investigate a few interesting documents

# investigate document 3 & 6
tidy(AssociatedPress) %>%
  filter(document == 3) %>%
  arrange(desc(count))

Modeling Topics as Acts in Macbeth…?

# convert to Document Term Matrix
ModernMacbeth_DTM <- 
  ModernMacbeth %>%
  mutate(act_scene = gsub(pattern = "\\.txt", replacement = "", x = doc_id)) %>%
  unnest_tokens(output = word, input = text) %>%
  mutate(word = gsub(pattern = "’", replacement = "'", x = word)) %>%
  anti_join(stop_words, by = c("word" = "word")) %>%     
  count(act_scene, word, sort = TRUE) %>%   
  cast_dtm(document = act_scene, term = word, value = n)
# LDA topic model (5 topics)
scenes_lda <- 
  ModernMacbeth_DTM %>%
  LDA(k = 5, control = list(seed = 380))  
# Per-topic-per-word probabilities 
scene_topics <- tidy(scenes_lda, matrix = "beta")
scene_topics
# Visualize top per-topic-per-word probabilities
top_terms <- 
  scene_topics %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)
top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

# Per-document classification
scenes_gamma <- tidy(scenes_lda, matrix = "gamma")
scenes_gamma %>% 
  arrange(document, gamma)
# Topics vs Acts?
scenes_gamma <- 
  scenes_gamma %>%
  separate(col = document, into = c("act", "scene"), sep = "_", convert = TRUE)
scenes_gamma %>%
  ggplot(aes(factor(topic), gamma)) +
  geom_boxplot() +
  facet_wrap(~ act)

Scene alignment to topics

Road map (again)

